[KVCache] Support only flush FD GPU Cache index by AttentionStore#7609
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7609 +/- ##
==========================================
Coverage ? 71.66%
==========================================
Files ? 419
Lines ? 57885
Branches ? 9085
==========================================
Hits ? 41485
Misses ? 13569
Partials ? 2831
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
22dab4e to
429ed50
Compare
429ed50 to
6ead139
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-04-28 00:33:47
📋 Review 摘要
PR 概述:在 FD_AS_ONLY_FLUSH 模式下支持 GPU Cache 块被驱逐时通过 AttentionStore 刷新 KV 缓存索引,跳过实际数据写入,仅更新索引状态。
变更范围:fastdeploy/cache_manager/(cache_tasks、cache_transfer_manager、prefix_cache_manager)、fastdeploy/envs.py、tests/cache_manager/
影响面 Tag:[KVCache]
📝 PR 规范检查
标题 [KVCache] Support only flush FD GPU Cache index by AttentionStore 含有效官方 Tag,描述结构完整(Motivation / Modifications / Usage / Accuracy Tests / Checklist 均已填写),整体合规。✓
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/cache_manager/cache_transfer_manager.py:989 |
write_back_storage_task 对 FD_AS_ONLY_FLUSH 缺少 backend 类型守卫,非 attention_store 后端下会静默跳过所有写操作 |
| ❓ 疑问 | fastdeploy/cache_manager/prefix_cache_manager.py:1452 |
hash_value_flush_info 只保留 min_depth 节点的 token_ids,需确认其长度能覆盖更深层被驱逐的 block |
| ❓ 疑问 | fastdeploy/cache_manager/prefix_cache_manager.py:1251 |
is_sync=False 时 _flush_only_storage_task 子进程仍发送 put_transfer_done_signal,请确认消费侧无孤立信号积压风险 |
总体评价
整体设计思路清晰,AS-only flush 路径实现合理,单测覆盖了核心 eviction 场景。主要需关注 write_back_storage_task 中缺少 backend 类型守卫的潜在风险,以及 token_ids 覆盖范围和孤立信号两个边界问题,建议作者确认后合入。
| self.storage_backend | ||
| ), f"storage_backend not initialized, storage_backend_type: {self.storage_backend_type}" | ||
|
|
||
| if envs.FD_AS_ONLY_FLUSH: |
There was a problem hiding this comment.
🟡 建议 write_back_storage_task 中对 FD_AS_ONLY_FLUSH 的检查缺少 storage_backend_type 过滤
当前代码在 FD_AS_ONLY_FLUSH=True 时无条件 early return 到 _flush_only_storage_task,但该函数内部只在 storage_backend_type == "attention_store" 时执行真正的 flush,其他 backend 下整个 try 块是 no-op。这意味着若用户在 non-attention_store 后端下误开 FD_AS_ONLY_FLUSH=1,所有写操作都会被静默跳过,缓存永久丢失且无报错。
建议在此处加 backend 类型检查,或在 _flush_only_storage_task 中对非 attention_store backend 显式抛出异常:
if envs.FD_AS_ONLY_FLUSH:
if self.storage_backend_type != "attention_store":
raise ValueError(
f"FD_AS_ONLY_FLUSH is only supported with attention_store backend, "
f"but got: {self.storage_backend_type}"
)
return self._flush_only_storage_task(task)| self.gpu_lru_leaf_set.remove(node) | ||
| if self.cache_config.num_cpu_blocks < need_block_num: | ||
| if node.shared_count == 0 and node.is_gpu_leaf_node: # 直接回收 | ||
| if envs.FD_AS_ONLY_FLUSH and self.kvcache_storage_backend == "attention_store": |
There was a problem hiding this comment.
❓ 疑问 hash_value_flush_info 只保留 min_depth 节点,token_ids 取自最浅节点
当同一 input_hash_value 下有多个不同深度的节点被批量驱逐(如 depth=2,3,4 都命中)时,这里只保留 min_depth 节点的 token_ids,最终只发送一个 flush task:start_write_block_idx=min_depth-1。
已查阅 attention_store.flush_token_index 实现,确认其语义是「从 start_block_idx 到末尾的所有 block 状态都更新」,所以一次 flush 可以覆盖从最浅节点到叶节点的全部范围。此逻辑正确。
但有一个边界问题需要作者确认:最浅节点的 input_ids(token_ids)是否包含足够长的序列,使 SDK 能正确定位到更深层的 block? 若 input_ids 只编码到 min_depth 对应的 block,SDK 可能无法覆盖更深的驱逐范围。
| raise ValueError(err_msg) | ||
|
|
||
| self.task_write_back_event[task.task_id] = Event() | ||
| if is_sync: |
There was a problem hiding this comment.
❓ 疑问 is_sync=False 时未创建 Event,但 _flush_only_storage_task 子进程仍会调用 put_transfer_done_signal
free_block_ids_async 中通过 issue_write_back_storage_task(flush_task, is_sync=False) 触发 flush,此时主进程不创建 task_write_back_event,也不等待完成。但子进程的 _flush_only_storage_task 在执行后仍会调用 put_transfer_done_signal(result),这个信号在主进程端找不到对应的 Event 接收者,会被静默忽略。
在 GPU 驱逐频繁时(如大量 prefix 命中后的 eviction)可能积累较多孤立信号。请确认 put_transfer_done_signal 的消费侧逻辑在找不到对应 task_id 时是否完全安全(无内存泄漏、无锁死)。
Motivation
This PR improves the
FD_AS_ONLY_FLUSHflow for AttentionStore so FastDeploy can flush KV cache index state when GPU cache blocks are evicted, especially in pure-GPU cache deployments without CPU cache.It adds the required flush metadata to support more accurate AttentionStore index updates for GPU eviction.
Modifications
WriteStorageTaskwith:flush_cache_existsto indicate whether cache still exists on the current node inFD_AS_ONLY_FLUSHmode.start_write_block_idxto support partial flush/write from a specified block index.cache_transfer_managerAS-only flush path to callAttentionStore.flush_token_index(...)with bothstart_write_block_idxandreside_in_gpu.FD_AS_ONLY_FLUSHto the cache transfer manager subprocess.prefix_cache_manager.free_block_ids_async(...)to emit flush-only tasks when GPU cache blocks are directly evicted inFD_AS_ONLY_FLUSH+attention_storemode.FD_AS_ONLY_FLUSHenvironment variable entry infastdeploy/envs.py.flush_cache_exists=Falsegpu_block_idsin flush-only modestart_write_block_idx=depth-1Usage or Command
For
FD_AS_ONLY_FLUSHmode with AttentionStore:export FD_AS_ONLY_FLUSH=1Reference test command:
Accuracy Tests
N/A. This PR does not change model forward results or kernel numerical behavior. It only updates KV cache index flush metadata and adds unit tests for cache manager behavior.
Checklist
[KVCache] Support flush FD GPU/CPU Cache index by AttentionStore[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag. (N/A for currentdevelopPR)